How Global Temperatures have changed since the Industrial Revolution and their relationship with GDP
Authors
Suraj Karakulath
Modou Lamin Manjang (Momo)
Gillian Baez Colon
Published
December 1, 2022
Abstract
This report analyzes global temperature changes using deviations from a baseline and determines the relationship between GDP and temperature deviation via a linear model. Temperatures have increased globally throughout the years with a sharper increase in the last 50 years. The relationship between global temperatures and GDP is not entirely clear.
Introduction
Climate change refers to the long-term shifts in weather and temperature patterns driven by human activities. The effects of these shifts are catastrophic natural disasters, extinctions, forced migrations, and more. While there is an overwhelming amount of climate data available, there are yet to be significant achievements in the world of climate data science because of the complex nature of the data. Nevertheless, the field is progressing, thus, developing models to understand and predict climate change.
Berkeley Earth put together the data set, which compiles three of the most cited land and ocean temperature data sets: NOAA’s MLOST, NASA’s GISTEMP, and the UK’s HadCrut. It combines 1.6 billion temperature reports from 16 pre-existing archives, including Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv), which tracks the average land and ocean temperatures from 1750 along with their uncertainties and their maximums and minimums (from 1850) until 2015. Additionally, it contains sheets that slice the data by country, state, and city.
We intend to explore how temperatures have risen worldwide since 1750 and which regions (countries) have experienced the most extreme changes using the land and ocean temperature variables and country names. Then, we plan to determine the relationship between ta country’s GDP and climate change using temperature variables.
Our team selected these research questions because they work with the complex spatiotemporal nature of the data, which is often an obstacle when incorporating traditional data science methods and principles. While it is a common belief that temperatures have risen and will continue to do so (global warming), it is also likely that the fall and winter seasons will become colder. Our goal is to visualize these deviations in temperature since the Industrial Revolution and determine which regions have been affected the most. We hypothesize that temperatures have deviated the most in the last 50 years and countries with lower GDPs have been affected the most.
Data
Data source: https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data
Rows: 577462 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Country
dbl (2): AverageTemperature, AverageTemperatureUncertainty
date (1): dt
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
global_temp <-read_csv("GlobalTemperatures.csv")
Rows: 3192 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): LandAverageTemperature, LandAverageTemperatureUncertainty, LandMax...
date (1): dt
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning and Exploration
Code
head(global_temp)
# A tibble: 6 × 9
dt LandAvera…¹ LandA…² LandM…³ LandM…⁴ LandM…⁵ LandM…⁶ LandA…⁷ LandA…⁸
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1750-01-01 3.03 3.57 NA NA NA NA NA NA
2 1750-02-01 3.08 3.70 NA NA NA NA NA NA
3 1750-03-01 5.63 3.08 NA NA NA NA NA NA
4 1750-04-01 8.49 2.45 NA NA NA NA NA NA
5 1750-05-01 11.6 2.07 NA NA NA NA NA NA
6 1750-06-01 12.9 1.72 NA NA NA NA NA NA
# … with abbreviated variable names ¹LandAverageTemperature,
# ²LandAverageTemperatureUncertainty, ³LandMaxTemperature,
# ⁴LandMaxTemperatureUncertainty, ⁵LandMinTemperature,
# ⁶LandMinTemperatureUncertainty, ⁷LandAndOceanAverageTemperature,
# ⁸LandAndOceanAverageTemperatureUncertainty
Code
dim(global_temp) #3192 x 9
[1] 3192 9
Global_temp contains 3192 observations of 9 variables.
Code
str(global_temp)
spc_tbl_ [3,192 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ dt : Date[1:3192], format: "1750-01-01" "1750-02-01" ...
$ LandAverageTemperature : num [1:3192] 3.03 3.08 5.63 8.49 11.57 ...
$ LandAverageTemperatureUncertainty : num [1:3192] 3.57 3.7 3.08 2.45 2.07 ...
$ LandMaxTemperature : num [1:3192] NA NA NA NA NA NA NA NA NA NA ...
$ LandMaxTemperatureUncertainty : num [1:3192] NA NA NA NA NA NA NA NA NA NA ...
$ LandMinTemperature : num [1:3192] NA NA NA NA NA NA NA NA NA NA ...
$ LandMinTemperatureUncertainty : num [1:3192] NA NA NA NA NA NA NA NA NA NA ...
$ LandAndOceanAverageTemperature : num [1:3192] NA NA NA NA NA NA NA NA NA NA ...
$ LandAndOceanAverageTemperatureUncertainty: num [1:3192] NA NA NA NA NA NA NA NA NA NA ...
- attr(*, "spec")=
.. cols(
.. dt = col_date(format = ""),
.. LandAverageTemperature = col_double(),
.. LandAverageTemperatureUncertainty = col_double(),
.. LandMaxTemperature = col_double(),
.. LandMaxTemperatureUncertainty = col_double(),
.. LandMinTemperature = col_double(),
.. LandMinTemperatureUncertainty = col_double(),
.. LandAndOceanAverageTemperature = col_double(),
.. LandAndOceanAverageTemperatureUncertainty = col_double()
.. )
- attr(*, "problems")=<externalptr>
The dt variable is already in date-time type while all others are numbers for the temperature and their uncertainties.
Code
summary(global_temp)
dt LandAverageTemperature LandAverageTemperatureUncertainty
Min. :1750-01-01 Min. :-2.080 Min. :0.0340
1st Qu.:1816-06-23 1st Qu.: 4.312 1st Qu.:0.1867
Median :1882-12-16 Median : 8.611 Median :0.3920
Mean :1882-12-15 Mean : 8.375 Mean :0.9385
3rd Qu.:1949-06-08 3rd Qu.:12.548 3rd Qu.:1.4192
Max. :2015-12-01 Max. :19.021 Max. :7.8800
NA's :12 NA's :12
LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
Min. : 5.90 Min. :0.0440 Min. :-5.407
1st Qu.:10.21 1st Qu.:0.1420 1st Qu.:-1.335
Median :14.76 Median :0.2520 Median : 2.950
Mean :14.35 Mean :0.4798 Mean : 2.744
3rd Qu.:18.45 3rd Qu.:0.5390 3rd Qu.: 6.779
Max. :21.32 Max. :4.3730 Max. : 9.715
NA's :1200 NA's :1200 NA's :1200
LandMinTemperatureUncertainty LandAndOceanAverageTemperature
Min. :0.0450 Min. :12.47
1st Qu.:0.1550 1st Qu.:14.05
Median :0.2790 Median :15.25
Mean :0.4318 Mean :15.21
3rd Qu.:0.4582 3rd Qu.:16.40
Max. :3.4980 Max. :17.61
NA's :1200 NA's :1200
LandAndOceanAverageTemperatureUncertainty
Min. :0.0420
1st Qu.:0.0630
Median :0.1220
Mean :0.1285
3rd Qu.:0.1510
Max. :0.4570
NA's :1200
The dates range from 1750-01-01 to 2015-12-01. So (2015-1750)*12+12 = 3192 dates (for every month), which matches our number of rows in the dataset. We have one unique record for every month.
There are missing values for all variables except date. LandAverageTemperature and its uncertainty has 12 missing values, while the rest of the temperature columns have 1200 missing values each.
We want to find out which are the dates for which there are missing values.
First, the missing values for LandAverageTemperature:
# A tibble: 12 × 9
dt LandAver…¹ LandA…² LandM…³ LandM…⁴ LandM…⁵ LandM…⁶ LandA…⁷ LandA…⁸
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1750-11-01 NA NA NA NA NA NA NA NA
2 1751-05-01 NA NA NA NA NA NA NA NA
3 1751-07-01 NA NA NA NA NA NA NA NA
4 1751-10-01 NA NA NA NA NA NA NA NA
5 1751-11-01 NA NA NA NA NA NA NA NA
6 1751-12-01 NA NA NA NA NA NA NA NA
7 1752-02-01 NA NA NA NA NA NA NA NA
8 1752-05-01 NA NA NA NA NA NA NA NA
9 1752-06-01 NA NA NA NA NA NA NA NA
10 1752-07-01 NA NA NA NA NA NA NA NA
11 1752-08-01 NA NA NA NA NA NA NA NA
12 1752-09-01 NA NA NA NA NA NA NA NA
# … with abbreviated variable names ¹LandAverageTemperature,
# ²LandAverageTemperatureUncertainty, ³LandMaxTemperature,
# ⁴LandMaxTemperatureUncertainty, ⁵LandMinTemperature,
# ⁶LandMinTemperatureUncertainty, ⁷LandAndOceanAverageTemperature,
# ⁸LandAndOceanAverageTemperatureUncertainty
They seem to be random dates: one day in Nov 1, 1750, then months May, July, Oct, Nov, Dec in 1751 and then Feb, May, Jun, Jul in 1752. There are only 10 visible. These are only a few entries compared to the total number of observations. And when we do use them later, it will usually be in aggregation functions like mean, where the na.rm=TRUE parameter will take care of removing them.
For the LandMaxTemperature, the 1200 missing values are:
# A tibble: 1,200 × 9
dt LandAver…¹ LandA…² LandM…³ LandM…⁴ LandM…⁵ LandM…⁶ LandA…⁷ LandA…⁸
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1750-01-01 3.03 3.57 NA NA NA NA NA NA
2 1750-02-01 3.08 3.70 NA NA NA NA NA NA
3 1750-03-01 5.63 3.08 NA NA NA NA NA NA
4 1750-04-01 8.49 2.45 NA NA NA NA NA NA
5 1750-05-01 11.6 2.07 NA NA NA NA NA NA
6 1750-06-01 12.9 1.72 NA NA NA NA NA NA
7 1750-07-01 15.9 1.91 NA NA NA NA NA NA
8 1750-08-01 14.8 2.23 NA NA NA NA NA NA
9 1750-09-01 11.4 2.64 NA NA NA NA NA NA
10 1750-10-01 6.37 2.67 NA NA NA NA NA NA
# … with 1,190 more rows, and abbreviated variable names
# ¹LandAverageTemperature, ²LandAverageTemperatureUncertainty,
# ³LandMaxTemperature, ⁴LandMaxTemperatureUncertainty, ⁵LandMinTemperature,
# ⁶LandMinTemperatureUncertainty, ⁷LandAndOceanAverageTemperature,
# ⁸LandAndOceanAverageTemperatureUncertainty
These values are for all dates from Jan 1750 all the way to Dec 1849, for a total of 1200 rows. ((1849-1750)*12+12 =1200))
All of these have the other temperatures also missing.
We can remove these rows if we are working with these variables, but we are not going to be using the LandMaxTemperature anyway. So for now we will keep it to see how average temperatures have risen over the years.
Visualizations
How have temperatures risen across the world since 1750?
We hypothesize that global temperatures have risen since 1750, with a sharper increase in the last 50 years. We will confirm this by plotting the LandAverageTemperature against the date.
There is a visible increase in the years from 1900 to 2000 and beyond, with 4-5 “bands” that most likely indicate seasons. The data from 1750 to 1850 has a bit more of noise/uncertainty. This “noise” may be the result of unstandardized data collection and multiple sources.
This displays the entire dataset with temperatures for every month, which fluctuate due to season in every location. We can smooth it out by breaking it down by year and decade or comparing months across the years.
There is a clear rise in the average temperature over the last two centuries and a steep rise in the last 10-20 years. This proves our hypothesis somewhat that the sharpest increase has occurred within the last 50 years.
It is clear that global temperatures have risen and will likely continue to do so. We want to explore precisely how much they have been increasing. For this, we need a metric to identify how far the LandAverageTemperature has deviated from a baseline that makes sense.
Climate science is a constantly evolving field; different models use different baselines. The baseline period is chosen depending on the specific research question addressed, the availability of data for that time, and the ability of the model to accurately simulate climate conditions over that period.
Some studies use the first 100 years of available data as a baseline. While other more recent climate models, such as paleoclimate ones, use pre-Industrial Revolution years as a baseline. This accounts for the significant effects of industrialization on climate.
We selected the average of the previous 100 years (1750 to 1850) as an initial baseline to compare how the average temperatures have risen since 1850.
Code
first <- global_temp |>filter(dt >='1750-01-01'& dt <='1849-12-01')global_mean <-mean(first$LandAverageTemperature, na.rm =TRUE)
Every decade the departure from baseline (on average for that decade) has increased.
Which countries have departed from the baseline the most?
For this, we explore the country dataset.
Code
#|label: summarysummary(global_temp_country)
dt AverageTemperature AverageTemperatureUncertainty
Min. :1743-11-01 Min. :-37.66 Min. : 0.05
1st Qu.:1862-12-01 1st Qu.: 10.03 1st Qu.: 0.32
Median :1914-04-01 Median : 20.90 Median : 0.57
Mean :1909-04-11 Mean : 17.19 Mean : 1.02
3rd Qu.:1964-03-01 3rd Qu.: 25.81 3rd Qu.: 1.21
Max. :2013-09-01 Max. : 38.84 Max. :15.00
NA's :32651 NA's :31912
Country
Length:577462
Class :character
Mode :character
Dataset has only 4 variables: 1 datetime, 2 float for the temperature and uncertainty, and 1 character for the country name. The dates start earlier at 1743 November and end at September 2013.
Code
length(unique(global_temp_country$Country)) #243
[1] 243
There are 243 countries included with monthly AverageTemperature records for each country in that time period (with some missing values).
dt AverageTemperature
0 32651
AverageTemperatureUncertainty Country
31912 0
Calculating departures from the baseline for each country
We calculate the departure of each country from the global mean and explore how much each country has deviated from their own baseline since different regions have different climate characteristics (e.g. a meaningful baseline for Russia may be too cold for Qatar, or the global mean may not be a meaningful baseline for Greenland).
Calculating the departure from their own mean AverageTemperature of the baseline years (before 1850).
Code
library(tidyr)#Finding the baseline mean for each country firstcountry_temp_means <- global_temp_country |>filter(dt <='1849-12-01' ) |>group_by(Country) |>summarise(country_mean =mean(AverageTemperature,na.rm=TRUE))
Now ranking the countries with the maximum departure from their own mean for the baseline years.
# A tibble: 243 × 2
Country max_departure
<chr> <dbl>
1 Russia 23.0
2 Mongolia 22.1
3 Kazakhstan 20.9
4 Canada 20.7
5 Greenland 19.4
6 Denmark 19.2
7 Uzbekistan 19.0
8 Turkmenistan 18.2
9 Finland 18.2
10 Estonia 17.8
# … with 233 more rows
Code
#had to suppress warnings due to a complex warning message "no non-missing arguments to min" etc.
The countries that have the highest departures from their own baseline turn out to be some of the colder countries, suggesting that these regions are seeing the biggest impact of climate change.
As seen in each of the plots, the departures from the baseline years of each country since 1970 has been increasing all the way.
Which latitudes and longitudes have the highest departure from the baseline?
Let’s explore the temperature changes at different latitudes and longitudes. For this we import a dataset that maps country codes with latitudes and longitudes.
[1] "Åland"
[2] "Africa"
[3] "Antigua And Barbuda"
[4] "Asia"
[5] "Baker Island"
[6] "Bonaire, Saint Eustatius And Saba"
[7] "Bosnia And Herzegovina"
[8] "Burma"
[9] "Côte D'Ivoire"
[10] "Congo (Democratic Republic Of The)"
[11] "Congo"
[12] "Curaçao"
[13] "Denmark (Europe)"
[14] "Europe"
[15] "Falkland Islands (Islas Malvinas)"
[16] "Federated States Of Micronesia"
[17] "France (Europe)"
[18] "French Southern And Antarctic Lands"
[19] "Guinea Bissau"
[20] "Heard Island And Mcdonald Islands"
[21] "Isle Of Man"
[22] "Kingman Reef"
[23] "Macedonia"
[24] "Netherlands (Europe)"
[25] "North America"
[26] "Oceania"
[27] "Palestina"
[28] "Palmyra Atoll"
[29] "Reunion"
[30] "Saint Barthélemy"
[31] "Saint Kitts And Nevis"
[32] "Saint Martin"
[33] "Saint Pierre And Miquelon"
[34] "Saint Vincent And The Grenadines"
[35] "Sao Tome And Principe"
[36] "Sint Maarten"
[37] "South America"
[38] "South Georgia And The South Sandwich Isla"
[39] "Svalbard And Jan Mayen"
[40] "Timor Leste"
[41] "Trinidad And Tobago"
[42] "Turks And Caicas Islands"
[43] "United Kingdom (Europe)"
[44] "Virgin Islands"
There are a few “countries” in our original dataset that are missing from the coordinates dataset. These are cases like
Åland - an autonomous region in Finland (which is best to be removed if it has temperature records for the same dates as Finland)
Caribbean countries like Antigua And Barbuda, Bonaire, Saint Eustatius And Saba, Curaçao (which have their own different unique cases such as being Dutch colonies so best to remove them) and French Southern and Antarctic Lands (which is an overseas Territory of France)
Islands like Baker Island (which can be removed)
Entire continents like Africa, Europe and Asia aggregated (which can be removed as we are only looking at individual countries)
Cases of minor change in spelling like capitalisation of A in Bosnia and Herzegovina, Côte d’Ivoire which is also spelled Côte D’Ivoire, Congo which is variously named as just Congo, Democratic Republic and other variations, Denmark which is also included as Denmark (Europe), France and France (Europe), Falkland Islands (Islas Malvinas) and its square bracket variation, Micronesia which also appears as Federated States of Micronesia (all of which are corrected to their simplest versions, Burma which was the name of Myanmar until 1989)
Code
country_dep$Country <-gsub("France\\(Europe\\)", "France", country_dep$Country)country_dep$Country <-gsub("Denmark\\(Europe\\)", "Denmark", country_dep$Country)country_dep$Country <-gsub("Netherlands\\(Europe\\)", "Netherlands", country_dep$Country)country_dep$Country <-gsub("Federated States Of Micronesia", "Micronesia", country_dep$Country)country_dep$Country <-gsub("United Kingdom\\(Europe\\)", "United Kingdom", country_dep$Country)country_dep$Country <-gsub("Burma", "Myanmar", country_dep$Country)country_dep$Country <-gsub("Congo", "Congo", country_dep$Country)country_dep$Country <-gsub("Bosnia And Herzegovina", "Bosnia and Herzegovina", country_dep$Country)country_dep$Country <-gsub("Côte D'Ivoire", "Côte d'Ivoire", country_dep$Country)
What does the temperature change look like on a global scale over the years?
For this, we use an interactive world map to show the latitude and longitudes for the countries, a color grading to show the temperature change (average of departures for each year) and the time dimension (year) as a movable slider.
Code
with_dep_by_year <- country_temps |>group_by(country, yearDt) |>summarise(departure_by_year =mean(departure, na.rm =TRUE)) |>suppressMessages()#the above line has created a dataframe with country,yearDT and the average of departures for that year for each country. Now all we need is the latitude and longitude coordinates to map back to the countrymap_data <- country_coordinates |>left_join(with_dep_by_year, by =c("country"="country")) |>select(latitude, longitude, country, yearDt, departure_by_year)
Code
suppressPackageStartupMessages(library(plotly))plot_geo(map_data, lon =~longitude, lat =~latitude, color=~map_data$departure_by_year, frame = map_data$yearDt)
No scattergeo mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Warning: Ignoring 1 observations
As the years go buy, all the dark spots turn to green or closer to green, suggesting that on average the departures from the baseline for each country is increasing every year.
Analysis
In this section, we used a linear model to determine if there is any correlation between a country’s GDP and their deviation from the baseline temperature. We import a new dataset for GDP and clean it to be able to match our countries.
#There is x infront of each year in the dataset so I removed the first character dta1$Year <-substring(dta1$Year, 2)dta1 <- dta1 %>%rename(country = Country.Name)dta2 <- dta1[,-2]
Now that we have one measure of GDP for each country, we need one single measure of departure to find out the relation between GDP and temperature change.
For this single measure of departure, we average all departures from 1950 until the end.
The results of a linear model between the departures and the GDP are not showing a clear linear relation. The coefficient 2.971e-14 is very small suggesting that GDP is not that significant of a metric to determine the average departure of temperatures.
There could be many other factors, such as HDI, that influence the extent to which temperatures deviate from the baseline. In other words, we will need a more complex model.
Conclusion
In summary, this report’s findings do not deviate from what is already well-known. Global temperatures have been rising steadily because of climate change and humanization. We hypothesized that the sharpest increase in temperatures occurred within the last 50 years. The data supports this conclusion. However, the deviations in temperature are not uniform; some countries have been affected more than others. Research regarding climate change suggests that high-latitude countries are seeing more significant temperature deviations. Many factors contribute to these shifts, for example, positive feedback, which results from melting ice absorbing the sun’s warmth rather than the solid ice reflecting it. Therefore, colder countries are seeing more shifts in weather and temperature patterns.
Furthermore, this report attempted to determine the relationship between GDP and temperature deviations. We hypothesized that countries with lower GDPs would have higher departures and, thus, are more susceptible to climate change. However, the linear model did not clearly define the correlation between these two variables. The model likely lacks complexity; there are many interacting variables, each of which could affect the extent to which temperatures are affected. For example, the human development index (HDI). Nevertheless, the findings may serve as the foundation for further research.